Skip to content

Conversation

@sixpluszero
Copy link
Contributor

@sixpluszero sixpluszero commented Nov 6, 2025

[server] Add a heartbeat lag based replica auto resubscribe feature

This is an optional feature that aims to improve consumption resiliency. When we observe current version resource heartbeat delay grows beyond certain threshold, it could potentially be that underlying consumer is not functioning well. While we are also working on figuring out the root cause of that, this feature aims to mitigate it in the runtime by asking the SIT to try to perform a resubscribe action that brings the replica to another consumer in the same consumer pool.

Solution

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

Work in progress

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@sixpluszero sixpluszero marked this pull request as draft November 6, 2025 18:15
@sixpluszero sixpluszero force-pushed the conditionBasedResubscribe branch from 642919d to 09ab18a Compare November 20, 2025 04:28
@sixpluszero sixpluszero changed the title [WIP][server] Add a heartbeat lag based replica auto resubscribe feature [server] Add a heartbeat lag based replica auto resubscribe feature Nov 21, 2025
@sixpluszero sixpluszero marked this pull request as ready for review November 21, 2025 18:10
Copy link
Contributor

@haoxu07 haoxu07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for preparing this! Left a few comments.

* We will delegate to KafkaConsumerService to determine whether it takes this request as it has information.
*/
if (getServerConfig().isLagBasedReplicaAutoResubscribeEnabled() && TimeUnit.SECONDS
.toMillis(getServerConfig().getLagBasedReplicaAutoResubscribeThresholdInSeconds()) < lag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we only concern lag here, should we consider both lag and very few records polled from consumer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it is possible that underlying resubscribe operationh is very slow on waiting for the drainer to be cleared for specific partition, can we add some safeguard here to avoid enqueuing too many partititons into SIT resubscribe queue? We could check the SIT PSC # to prevent put too many partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline - we have one config to control each replica's resubscribe interval - it will make sure resubscribe behavior does not happen too frequent

@sixpluszero sixpluszero force-pushed the conditionBasedResubscribe branch from d3a410b to a322c10 Compare November 26, 2025 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants